Objective

Cluster daily environmental drivers to assess how they converge to drive spring runoff patterns from east-side Mt Mansfield watersheds

Shorten window of analysis to focus on spring melt period

# First we're going to filter the data to focus on March-May dynamics
  # We want to avoid incorporating too much of the climate pattern, which would wash out this spring signal
  # We'll want to play with the size of this window
  # For reference water year days-of-year: Mar 15 = 166; Apr 1 = 183; May 1 = 213; May 15 = 227; May 31 = 243
  mets_sub <- mets %>%
    # filter(doy_wyear >= 160 & doy_wyear <= 250)
    filter(doy_wyear >= 170 & doy_wyear <= 245)



We eventually want to look at differences between cumulative runoff yields at the end of the water year so let’s see if cumulative runoff yields at the end of the analysis window correlate with those at the end of the water year

## [1] r = 0.88
## [1] p = 0

## [1] r = 0.86
## [1] p = 0



Select variables to keep

If two variables are strongly correlated (negatively or positively) they can effectively “double-weight” a particular factor important in driving clustering; thus, keep just one of the variables to serve as a proxy for that factor

# For the analysis we only need one watershed/site of data, because the drivers we use are the same for both sites
  mets_sub_1site <- mets_sub %>% 
    filter(site == "WB")

# Select vars to keep Step 1
  # List the ID and response variables that will NOT be used in the analysis
  drop.info <- names(mets_sub_1site %>% 
                       select(# Remove the non-numerical ID/INFO
                              site, wyear, date, doy_wyear,
                              # Remove response variables
                              q_mm, q_mm_cum, q_mm_per,
                              # Remove other variables we know we won't want for SOM, but will want for graphing results later
                              precip_mm_cum, precip_mm_per))
  
  # List the other variables that will not be used in this analysis
  drop.vars <- names(mets_sub_1site %>% 
                       select(temp_max, temp_min,
                              # We'll keep the average temps measured at VWC/FEMC station, so drop those from Morrisville/Stowe airport
                              temp_avg,
                              # Let's just use the cumulative total volume of water diverted from the stream by the resort
                              use_snow_mm_daily, use_total_mm_daily,
                              # After looking at an analysis, I'm going to drop the use categories altogether,
                              use_total_mm_cum))  

# Look at correlations between variables
  mets_sub_1site %>% 
    # Remove the vars listed above
    select(-one_of(drop.info, drop.vars)) %>% 
    # Keep only complete observations/rows (no NAs in any of the columns)
    na.omit() %>% 
    # Visualize these
    lares::corr_cross(max_pvalue = 0.05, top = 20)

Not too worried about correlations

Self-organizing map (SOM)

Prepare data & set up grid/lattice dimensions

We’re only using complete observations/rows (no NAs in any columns)
According to the heuristic rule from Vesanto 2000, number of grid elements/grid size/nodes = 5 * sqrt(n)
To determine the the shape of the grid (ratio of columns to rows), we use the ratio of the first two eigen values of the input data set as recommended by Park et al. 2006

## [1] No. of complete observations: 1444 out of 1444 observations
## [1] No. of Vesanto nodes: 190
## [1] Ratio of columns to rows: 1.3



Run SOM for a suite of grid/lattice configurations, # of nodes, and # of clusters

Code courtesy of Kristen Underwood (hidden)

## [1] Topology: hexagonal
## [1] Data normalization method used: L2norm
## [1] Weighting method used: noPCA
## [1] No. of iterations: 1500
## [1] alphaCrs used: 0.05
## [1] alphaFin used: 1500



Choose the best SOM run based on non-parametric F-stat and quantization error

We want to maximize npF (ratio of b/w cluster variance) and minimize QE (mean distance b/w each data vector & best-matching unit)
Here are the top 50% of runs based on npF

Choose the best run from each high performing (top 50%) cluster
## [1] These were the top runs for each high-performing cluster #:

Run rows cols Nodes Clusters npF QE
9 12 16 192 3 1767.881 0.01
17 13 15 195 4 1716.028 0.01
22 13 15 195 5 1592.269 0.01


To examine this run in greater detail (e.g., component planes), see the ‘X_SOMplots_site_ … .pdf’



3 clusters - boxplots of independent variables by cluster





3 clusters - here’s how the clusters map on to the cumulative runoff curves



4 clusters - boxplots of independent variables by cluster





4 clusters - here’s how the clusters map on to the cumulative runoff curves



4 clusters - here’s how the window of analysis maps onto the entire cumulative runoff curves



4 clusters - exploratory data analysis with SOM results

Does the distribution of clusters across the spring window of analysis explain spring runoff differences?



Does the number of warm days with greater than average precip explain the spring runoff difference?

## [1] R2 = 0.12
## [1] p = 0.078

## [1] R2 = -0.04
## [1] p = 0.57

## [1] R2 = -0.05
## [1] p = 0.741



The runoff diff ~ number of days as cluster 4 is promising, but not perfect; what about total precip during this period?

## [1] R2 = -0.03
## [1] p = 0.505



For many years, the divergence occurs after switch from colder to warmer days (from cluster 1 to 2)
Does focusing on the warmer period yield any insight?

Here I use a running mode to identify when the switch occurs from predominantly cluster 1 to 2 or 4 (cold to warm transition)
Transition is plotted as the vertical black line
Note: the running mode (width = 3) worked well for all years but 2017; I shifted that one to first occurrence of cluster 2



## [1] R2 = 0.14
## [1] p = 0.063

## [1] R2 = 0.07
## [1] p = 0.147

## [1] R2 = 0.15
## [1] p = 0.055

## [1] R2 = 0.02
## [1] p = 0.244

## [1] R2 = 0.08
## [1] p = 0.121

## [1] R2 = -0.01
## [1] p = 0.383

## [1] R2 = 0.21
## [1] p = 0.029

## [1] R2 = 0.37
## [1] p = 0.004

## [1] R2 = 0.08
## [1] p = 0.132



5 clusters - examine boxplots of independent variables by cluster





5 clusters - examine how the clusters map on to the cumulative runoff curves